Group Members:

Table of Contents

Introduction

Customer classification is the practice of dividing a customer base into groups of individuals that are similar in specific ways relevant to marketing, such as age, gender, interests and spending habits.

Companies using customer classification operate under the fact that every customer is different and their marketing efforts would be better served if they target specific, smaller groups with messages that those consumers would find relevant and lead them to buy something. Companies also hope to gain a deeper understanding of their customers' preferences and needs with the idea of discovering what each class finds most valuable to more accurately tailor marketing materials toward that class. In this lab logistic regression has been implemented to predict a continuous value and then use a threshold to convert that value to a binary output for a classification task

1. Preparation and Overview

In this lab, customer classification will be done an automobile customer segmentation Dataset:
URL:https://www.kaggle.com/datasets/abisheksudarshan/customer-segmentation?select=train.csv

1.1 Data cleaning & quality checking

loading data and necessary library packages

Defining our attributes, their type and description

Attribute Attr. type Description
ID Nominal identification number of individual surveyed
Gender Nominal Sex of individual surveyed. Male or female.
Ever_Married Nominal Individual reported previously (or currently) married.
Age Ratio Age of individual surveyed.
Graduated Nominal Individual reported currently Graduated or not.
Profession Nominal Individual reported Profession type.
Work_Experience Nominal Individual reported how many years work experience they have
Spending Score Ordinal Individual reported how much money they are able to spend
Family_Size Nominal Individual reported their family size.
Var_1 Ordinal Anonymised Category for the customer
Segmentation Ordinal Our classification targets labels

checking for duplicated values and printing the value types and counts

================================================

ID column is removed because it does not contain any useful information. Reference did not give me any exact information about Var_1 column and what are the categories so in order to prevent confusion, I removed it.

===================================================

Based on the dataframe information, some missing values are found in the dataset above columns. I believe that these missing data might occur because some customers did not want to share their information with this company.

As the purpose of this survey is to find which car category with specific range price is suitable for customer to purchase. I believe working experience and profession are the significant factors for classification because they have direct impact on customers financial situation. Therefore I decide to eliminate the blank rows for these two columns and impute values For others column's missing values. In order to do imputation for categorical features, One hot encoding technique will be used. One hot encoding is a technique used in machine learning to convert categorical variables into a format that can be used for predictive modeling. In this way it can help avoid bias in predictive models caused by categorical variables. We also need to normalized our numerical data to transform the data to a common scale, so that different features can be compared on a level playing field.

replacing categorical features with numerical indicators

1.2 Imputation Techniques

Let's try two methods of imputation on the bmi and smoking_status variables to see which one works better:

I will use median imputation for missing values to have robust dataset

1.2-1 Split-Impute-Combine

1.2-2 K Nearest Neighbor Imputation with Scikit-learn

It seems that split and combine imputer works better than KNN imputer

1.3 PCA for dimensional reduction & Data Split

now I break down my pandas data frame to the X part (data features) and y part (target classed or class labels)

we need to reset index for our pandas dataframe Otherwise, indices cannot be recognized in the regression part and gives an error

I used PCA method for dimensional reduction

This graph shows that lower dimension contains the main features of the data which are essential for classification. increasing dimension makes our dataset more sparse and harder to classify.This graph shows that 96% of data are in the 4 dimension.

Linear Discriminant Analysis (LDA) is a supervised classification algorithm used to find a linear combination of features that best separates two or more classes of data. It assumes that the data follows a Gaussian distribution and that the covariance matrix of each class is equal.

The goal of LDA is to reduce the dimensionality of the data while preserving as much of the class discriminatory information as possible. It accomplishes this by finding the projection that maximizes the ratio of the between-class variance to the within-class variance.

LDA can be used for both binary and multi-class classification problems. It is commonly used in pattern recognition, face recognition, and bioinformatics.

One of the advantages of LDA is that it can handle high-dimensional data, making it useful in situations where the number of features is larger than the number of samples. LDA can also provide insights into the underlying structure of the data by revealing the most discriminative features.

However, LDA assumes that the classes are normally distributed and have equal covariance matrices, which may not always be the case in real-world data. Additionally, LDA may not perform well when the classes are heavily overlapped or when there are too few samples per class.

This graph shows our data spread in 3D plot
=====================================================
Divide your data into training and testing data using an 80% training and 20% testing split

The size of the dataset plays an important role in determining the appropriate data splitting strategy. If the dataset is small, it may not be possible to split the data into separate training and testing sets, as this can result in insufficient data for training the model. I think this dataset is large enogh that allows for split between training and testing data. The split between training and testing data is random ensuring that the model is not biased towards a particular subset of the data.It also seems that classes are nearly balanced between the training and testing sets. If the classes are not balanced, the model may be biased towards the majority class and perform poorly on the minority class. In addition, training set and testing set have nearly same statistical characteristics, it seems that spliting data to 80/20 works well with our data.

2. Modeling

Logistic regression was originally designed for binary classification tasks, where the goal is to predict a binary outcome (e.g. yes/no, true/false). However, it can also be used for multiclass classification tasks by extending the binary decision framework.

One way to extend logistic regression for multiclass classification is to use the one-vs-all (OvA) or one-vs-rest (OvR) strategy. In this approach, the multiclass classification problem is transformed into several binary classification problems. Specifically, for each class k, a binary classifier is trained to distinguish between observations of class k and observations of all other classes combined.

For example, suppose we have $K$ classes $(K>2)$ and we want to use logistic regression for multiclass classification. We can train $K$ binary classifiers, denoted by $g_1(x), g_2(x),..., g_K(x)$, where each classifier $g_k$ predicts whether an observation $x$ belongs to class $k$ or not. The decision rule for the OvR strategy is to assign the class label with the highest predicted probability among all K classifiers:

$$ \widehat{y} = \underset{k=1,..,K}{argmax} \ g(w^{T}_k x)$$

The probability that an observation x belongs to class k is estimated using the logistic function:

$$ P(y=k|x)= 1 / (1+exp(-w_k^T x))$$

where $w_k$ is the weight vector for the k-th classifier.

The weight vector $w_k$ for the $k_{th}$ classifier is learned by minimizing the logistic regression loss function:

$$ J(w_k) = -1/n * \sum_{i=1}^{n}[ y_i*log(g(w^{T}_k x_i)) + (1-y_i)*log(1-g(w^{T}_k x_i)) ]$$

where $ y_i=1$ if the $ i_{th}$ observation belongs to class $ k$ , and $ y_i=0$ otherwise. The optimization problem can be solved using gradient descent or other optimization algorithms.

In summary, logistic regression can be extended for multiclass classification using the OvA or OvR strategy, which involves training multiple binary classifiers and combining their outputs to make a multiclass decision.

2.1 Binary Logistic Regression

2.2 Stochastic Logistic Regression

2.3 Hessian Binary Logistic Regression

2.4 BFGS Binary Logistic Regression

2.5 MultiClass Logistic Regression

2.6 SKLearn Logistic Regression

2.7 Regression methods comprasion

To compare regression methods performance and finding optimal adjusting parameters value I visualize four regression method ('BFSBinary','Stochastic','Binary','Hessian','SKlearn'). everytime I will change just one value and I will extract the maximum value and use it for next changes. I start by changing Eta values:

The result shows that in eta=0.01 most of the regression method are in their maximum accuracy, so for next change I will set eta=0.005 as a constant value and I will investigate changing C effects.

The result shows that in C=0.001 most of the regression method are in their maximum accuracy, so for next change I will set eta=0.005 as a constant value and I will investigate changing C effects.

3.Modeling-Deployment

Results shows that we have a maximum accuracy of 46.5% using SKlearn Regression method work and then BFSBinary work best for our result.Hessian binary logistic regression can be sensitive to small changes. In some cases, the covariance matrix may not be well estimated, which can lead to biased results.Stochastic logistic regression can be more noisy than batch logistic regression because it updates the model parameters using small subsets of the data, which can introduce more variance into the model. Stochastic logistic regression can be more sensitive to the initial parameter values and the learning rate, which can lead to convergence issues if they are not set correctly.it also can be less stable than batch logistic regression because of the random nature of the updates, which can lead to oscillations in the model performance.

SK_learn generally performs well in our dataset, it is considered as the most accurate in our data. the advantage of using SK_learn is that it has stable high performance and does not lose its high accuracy with drastic changes which would not lead to oscillations.Scikit-learn's implementation of logistic regression is very fast, computationally efficient and can handle large datasets with ease. it can be regularized to prevent overfitting and improve generalization performance and it can be applied to a wide range of problems, including binary classification, multi-class classification, and probabilistic regression. Overall, scikit-learn's logistic regression model is a reliable and versatile algorithm that can produce good results for a variety of binary classification tasks.Taking all mentioned aspects into consideration, I advise to use scikit-learn.

4. Additional analysis

logistic regression using mean square error as your objective function (instead of maximum likelihood)

To optimize logistic regression with both L1 and L2 regularization using mean squared error as the objective function, we need to define the loss function as:

$L(w) = \frac{1}{2n}\sum_{i=1}^{n}(y^{(i)}-g(w^{T}x^{(i)}))^2 - l1*C\sum_{j=1}^{m}|w_j| - l2 *C \sum_{j=1}^{m}w_j^2$

where $l1$ and $l2$ are the regularization parameters that control the strength of L1 and L2 regularization, respectively, and $m$ is the number of features in the data. The first term is the mean squared error, the second term is the L1 regularization term that promotes sparsity in the weight vector, and the third term is the L2 regularization term that penalizes large weights.

The gradient of the loss function with respect to the weight vector $w$ is given by:

$\nabla L(w) = \frac{1}{n}\sum_{i=1}^{n}(g(w^{T}x^{(i)})-y^{(i)})*diag(g(w^{T}x^{(i)})(1-g(w^{T}x^{(i)})))x^{(i)} -l1*C*\text{sign}(w) - 2C*l2*w$

where $\text{sign}(w)$ is the sign function that returns 1 if $w$ is positive, -1 if $w$ is negative, and 0 if $w$ is zero.

The Hessian of the loss function is given by:

$H_{ij} = \frac{\partial^2 L(w)}{\partial w_i\partial w_j} = \frac{1}{n}\sum_{k=1}^{n}diag(g(w^{T}x^{(k)})(1-g(w^{T}x^{(k)}))) *diag(-3g^{2}(w^{T}x^{(k)})+2g(w^{T}x^{(k)}+y(2g(w^{T}x^{(k)}-1))*x^{(k)}_ix^{(k)}_j -2C* l2*\delta{ij}$

where $\delta_{ij}$ is the Kronecker delta function, which is 1 when $i=j$ and 0 otherwise.

Our MLE model has higher accuracy than our MSE model because MSE (mean squared error) is a loss function and its goal is to predict a continuous target variable. MSE measures the average squared difference between the predicted and actual values. Using MSE as the loss function for logistic regression would not be appropriate, as the predicted values are probabilities that are bounded between 0 and 1. Squaring these probabilities would produce values that are not meaningful for probability estimates, and would not provide a suitable measure of how well the model is performing.Instead, logistic regression typically uses a loss function called binary cross-entropy, which measures the difference between the predicted probabilities and the actual binary outcomes. This loss function is designed to penalize the model more heavily for making incorrect predictions near the decision boundary (i.e., when the predicted probability is close to 0.5), where the model is less confident. So in the logistic regression model is typically trained using maximum likelihood estimation (MLE), which seeks to maximize the likelihood of the observed data given the model parameters.

5.Conclusion

In conclusion, logistic regression is a powerful statistical technique that can be used for customer classification in marketing and customer relationship management. By predicting the probability of a customer belonging to a certain category or segment based on their demographic, behavioral, or transactional characteristics, businesses can identify and target customers who are most likely to respond positively to their marketing efforts and to retain their loyalty over time.

To build an effective logistic regression model for customer classification, it is important to carefully select the outcome and predictor variables, estimate the model using maximum likelihood estimation, and evaluate the model's performance using various metrics. By incorporating regularization techniques such as L1 or L2 regularization, logistic regression can help to reduce overfitting and improve the generalization ability of the model.

Overall, customer classification based on logistic regression is a valuable tool for businesses seeking to improve their marketing and customer retention strategies, and can lead to more efficient and effective use of marketing resources, higher customer satisfaction, and increased revenue.

Reference

[1] Dr.Larson's lectures
[2] class videos
[3] https://www.statology.org/scatterplot-with-regression-line-python
[4] https://www.kaggle.com/code/nageshsingh/2d-and-3d-plots-using-matplotlib
[5] https://jakevdp.github.io/PythonDataScienceHandbook/04.12-three-dimensional-plotting.html